Understanding Climate
Change Trends with
Regression Analysis
11 May 2024
Intro to linear
regression
Prof. Amr Amin
INTRODUCTION
Climate change is one of the most pressing issues facing our planet today.
Global temperatures are rising, sea levels are increasing, and extreme
weather events are becoming more frequent. These changes have significant
consequences for human societies and natural ecosystems.
This project investigates climate change trends using regression analysis, a
statistical technique that allows us to quantify the relationships between
variables. We will focus on analyzing historical data on global temperatures,
greenhouse gas concentrations, and other relevant factors.
Climate change is a complex phenomenon with varying
impacts across the globe. Understanding these diverse trends
requires a tool that allows for easy visualization and
comparison. This is where our innovative web application
steps in.
Our web app offers an interactive platform to explore
historical and potential future climate patterns. It empowers
you to:
Plot Climate Through the Years: Visualize
historical climate data, such as temperature,
precipitation, or humidity, for multiple countries
over selected time periods.
Comparative Analysis: Overlay climate data for
different countries on the same graph, enabling you
to compare trends and identify regional variations.
Interactive Exploration: Dive deeper by zooming
in on specific regions or timeframes to gain a more
detailed understanding of climate changes.
This user-friendly interface is designed for a wide audience,
from climate experts to students and policymakers. It fosters
a deeper understanding of:
Country-Specific Impacts: Identify how climate
change is affecting individual countries, aiding in
targeted adaptation strategies.
Global Climate Trends: Observe broader patterns
of climate change across the globe, highlighting
areas of concern.
The Power of Data Visualization: Experience
climate data come to life through interactive graphs,
promoting informed decision-making.
Data Cleaning and preparation
Preparing the data for regression.
Insights:
Our raw data is ranges from the year 1750 to 2013 with records of the average global
temperature in each month.
With 3156 data entries we had a lot of problems with the data.
1. Too much variety
Since data collection techniques focus on the northern hemisphere more than the
southern, we see a lot of fluctuation as most continents plunge into summer and
winter making regression analysis difficult.
This was fixed by getting the mean of all the months for each year in our dataset:
2. high uncertainty in the period between 1750 and 1850
probably due to the evolution of data collecting and temperature measuring our
records are not very robust for that period.
This leads us to our main topic of preprocessing; Outlier detection.
Outlier detection:
IQR:
This method will not work for our data since the outliers at X=1750 :1850 are not the
same as the outliers at X=1950:2013
Z-score:
This method is more robust but it suffers from the same problems as the IQR method
making it unsuitable
Robust Regression (RANSAC):
Robust regression is an alternative to least squares regression when data
are contaminated with outliers or influential observations, and it can also
be used for the purpose of detecting influential observations.
https://stats.oarc.ucla.edu/r/dae/robust-regression/
RANSAC (Random Sample Consensus) is a robust algorithm used in machine
learning and computer vision to estimate model parameters in the
presence of outliers. https://medium.com/@chandu.bathula16/machine-learning-
concept-69-random-sample-consensus-ransac-e1ae76e4102a
This method is perfect for outlier detection in our context as it choses random samples and
does regression on them and removes the points with high error.
Our data after RANSAC:
Now we can apply all regression techniques without being affected by
inconsistent data
Model Building
Choosing a model to best fit climate change
Simple Linear Model:
The simple linear model is the basis for linear regression, so naturally it was the first
model to test with our data:
After applying regression techniques to our dataset using the sklearn python library
our equation came out to:
 
Predicting that the temperature at year 0 AD was 2.63 C. The model’s integrity
remained strong for predicting the values of the Earth’s temperature between the
years 1850 to 1950 but got weaker in recent years resulting in a low correlation
coefficient of 22.1%.
Polynomial regression:
Since our model only has one independent variable (Temperature) doing multivariate
regression for multiple features is either impossible or won’t result in useful insights.
This limits us to Polynomial regression.
 
Doing polynomial regression of degree 2 already resulted in much better results with
the equation:
     
This equation resulted in a correlation coefficient of 39.6%.
After testing we found that any higher polynomial degree than 2 would result in
overfitting
Drawbacks of this model:
This model is much better at predicting values close to our model as most models are
but where this model fails is at predicting very old values which is to be expected of a
fitted equation. Ex the predicted temperature of our globe at year 1000 AC is 37 C
which is far from the truth but the predicted temperature of the year 2020 was 9.45
which is very close to the actual value according to Berkely Earth
(https://berkeleyearth.org/data/) of 9.55 .
Even Though our application predicts values in a deterministic fashion; we use many
non-deterministic methods, which includes RANSAC and TestSplit. Which means any
change in the random seed will cause slightly different results.
After cleaning:
Doing our model building on the raw data wasn’t the optimal method but by a stroke
of luck after cleaning the model remained much more robust that the general linear
model with much more accurate predictions since the earths temperature is not
increasing in a linear way (the higher the temperature the more greenhouse gases are
released into our atmosphere due to many factors including gas expansion and the
increase in methane producing bacteria) as evident by our data having a positive B2
coefficient.
Depressing insight:
According to our model by the time most of us are ~70 the Global average
temperature will be 10.53. meaning sea levels would have risen significantly flooding
most costal cities and severely damaging our ecosystem.
Statistical Analysis:
In the last section we found that the degree of the polynomial fitting function we use
for our model highly impacts the results. We also found that a degree of 2 is best
option for optimal results, but how can we proof that it is significant?
By doing an Anova test ofcourse.
For calculating our F0 we will be using this formula.
This proves to us with 95% confidence that our x^2 coefficient is significant.
Do we have a problem in our model?
This can be simply checked by plotting our residuals
Since our Residuals don’t follow a pattern and are unrelated we know that our model
doesn’t have any issues that need to be solved.
In linear regression, residuals following a pattern
violate one of the key assumptions of the model,
leading to unreliable results. Here's why it's
problematic:
1. Non-random Errors: Linear regression assumes that the errors, represented by the
residuals, are random and independent. This means the errors shouldn't be predictable
based on the independent variables used in the model. If the residuals follow a pattern,
it suggests there's a systematic error in your model that the current variables can't
capture.
2. Underestimated Variability: When residuals follow a pattern, it indicates the
model is missing some important information that explains the data's variability. This
can lead to underestimating the true variance in the data, making your confidence
intervals for the regression coefficients too narrow. In simpler terms, the model
appears more accurate than it actually is.
3. Biased Coefficients: In some cases, patterned residuals can bias the estimated
coefficients of the independent variables. This means the relationships between the
independent variables and the dependent variable might be misrepresented.
Error Metrics:
Confidence Intervals:
"Climate change is the most severe
problem that we are facing today, more
serious even than the threat of terrorism."
- Stephen Hawking, physicist and
cosmologist
Rip 1942- 2018